Skip to content

Conversation

@conradludgate
Copy link
Contributor

@conradludgate conradludgate commented Oct 31, 2025

Motivation

As discussed on discord:

My team isn't comfortable with the performance concerns of enabling tokio's tracing feature since we do actually use tracing in our application and such there is potentially a measurable cost of evaluating our EnvFilter every time in our performance critical services.

Solution

Using Userspace Statically Defined Tracing (USDT) we expose lightweight probes that can be attached to at runtime with tools like bpftrace or dtrace. This is inspired by https://github.com/oxidecomputer/usdt.

The new functionality is behind a new unstable feature flag. Currently it only exposes some basic task events and not yet any resource events.

See USDT in the wild:

@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR labels Oct 31, 2025
fn task__terminate(task_id: u64) {}

fn task__waker__clone(task_id: u64) {}
fn task__waker__wake(task_id: u64) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea to represent waker as wake + drop and remove the need for wake_by_ref.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wasn't sure if I liked that 😅

@conradludgate conradludgate force-pushed the tokio-usdt branch 2 times, most recently from 638f594 to 3eb3cde Compare October 31, 2025 15:10
@conradludgate
Copy link
Contributor Author

Musings regarding usdt performance: oxidecomputer/usdt#490

@conradludgate
Copy link
Contributor Author

I think I'm going to vendor the minimal subset of the code needed to support USDT on the platforms supported by the usdt crate. Removing any extra dependencies in the process.

I'd like to work with the oxide devs to improve this but currently usdt brings in a lot of unnecessary dependencies, and it also doesn't allow cross compilation. Lastly I want to rework the generated code to reduce any performance impact.

Because I will remove all dependencies, I'll move it back to a cfg tokio_usdt and remove the feature

@hds
Copy link
Contributor

hds commented Nov 6, 2025

Are we using checked in assembly and usdt-impl instead of the macros created by the usdt to avoid proc-macros? Or is there some other reason this is desireable?

Also, are the files in utils/usdt generated? Could we get some instructions on how included?

@conradludgate
Copy link
Contributor Author

Are we using checked in assembly and usdt-impl instead of the macros created by the usdt to avoid proc-macros? Or is there some other reason this is desireable?

Also, are the files in utils/usdt generated? Could we get some instructions on how included?

I had a lot of issues with using usdt directly.

  1. It pulls in a lot of dependencies
  2. On macos it runs dtrace at build time
    i. It is currently impossible to cross compile with it as a dependency due to the current code architecture
  3. On linux I found that overly monomorphised probes would cause issues with linking, and usdt crate didn't give me the tools to work around that
  4. I wanted to squeeze some more performance out of the asm.

The asm was originally generated using the proc macro, but has since been rewritten. Individual probes shouldn't need to be regenerated, but adding new probes will need some manual effort. I've tried to abstract the asm to some shared macros, I have some more work to do there. The macos code would require the most effort if new probes are needed but I will also document it.

Some notable changes I've made to the asm are

  1. Not branching for trivial probes - eg wake(task_id) can just emit the NOP and skip checking if a probe is attached.
  2. Outlining the probe if otherwise monomorphised (like the poll-start probe). In this case we do keep the branch, but the branch then just calls a function to keep the instruction count small and to keep the probe confined to 1 address.

What we lose by not using the usdt crate is the freebsd/illumos support. I don't have any freebsd/illumos setups available to test, and the code to support usdt on those platforms is the most complicated.

I don't consider any of these changes impossible for the usdt crate to support, but it's going to take effort and considerable rewrites of large amounts of their code. Improving the asm is relatively easy and I'd be comfortable sacrificing the build constraints when that's done. However I do consider the concise asm to be a blocker since we really intend for this feature to be as close to 0 overhead as possible

@hds
Copy link
Contributor

hds commented Nov 7, 2025

Thanks for the details! As you've said, I think it would be important to document these steps (maybe a README.md in the usdt directory).

For Illumos support maybe we can convince someone at Oxide to add it, since it would benefit them perhaps the most. (-:

Comment on lines +72 to +75
fn task_details_inner(task_id: u64, name: &str, file: &str, line: u32, col: u32) {
// add nul bytes
let name0 = [name.as_bytes(), b"\0"].concat();
let file0 = [file.as_bytes(), b"\0"].concat();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn task_details_inner(task_id: u64, name: &str, file: &str, line: u32, col: u32) {
// add nul bytes
let name0 = [name.as_bytes(), b"\0"].concat();
let file0 = [file.as_bytes(), b"\0"].concat();
fn task_details_inner(task_id: u64, name: &CStr, file: &CStr, line: u32, col: u32) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the filename of a Location as a CStr will be stable in 1.92.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some concerns:

  1. This is the slow path - any conversions of &str -> &CStr should take place within this function.
  2. I would rather not require nightly for the file location. We can wait 6 weeks.
  3. Because the task name might contain a \0 somewhere in the middle, it ends up being a fair bit of code and error handling to construct such a string.

@Darksonn
Copy link
Contributor

Darksonn commented Nov 7, 2025

Not branching for trivial probes - eg wake(task_id) can just emit the NOP and skip checking if a probe is attached.

Can you share more details about this? Do I understand correctly that this works by replacing nop instructions with a different instruction (call or jmp or similar)?

@conradludgate
Copy link
Contributor Author

conradludgate commented Nov 7, 2025

Not branching for trivial probes - eg wake(task_id) can just emit the NOP and skip checking if a probe is attached.

Can you share more details about this? Do I understand correctly that this works by replacing nop instructions with a different instruction (call or jmp or similar)?

That is correct. I'll try and verify the exact instructions that get replaced (iirc linux will use an interrupt and macos uses a function call), but I've observed in both macos and linux when using lldb that there is a nop instruction and some simple register ops for the generated code of wake_by_ref.


I'm unable to test it right now, but the eBPF docs claim that linux uses an interrupt: https://docs.ebpf.io/linux/concepts/usdt/#attaching-with-ebpf.

When testing on aarch64-apple-darwin, I see a NOP being replaced with a FASTTRAP instruction.

@conradludgate
Copy link
Contributor Author

Here's some assembly differences, ignoring any label changes

wake_by_ref
.section .text.tokio::runtime::task::waker::wake_by_ref,"ax",@progbits
        .globl  tokio::runtime::task::waker::wake_by_ref
        .p2align        4
.type   tokio::runtime::task::waker::wake_by_ref,@function
tokio::runtime::task::waker::wake_by_ref:
        .cfi_startproc
        sub rsp, 24
        .cfi_def_cfa_offset 32
+       mov rax, qword ptr [rdi + 16]
+       mov rax, qword ptr [rax + 72]
+       mov rax, qword ptr [rdi + rax]
+       nop
        mov rax, qword ptr [rdi]
        lea rcx, [rsp + 8]
        lea rdx, [rsp + 16]
        .p2align        4
.LBB193_1:
        test al, 2
        jne .LBB193_2
        test al, 4
        jne .LBB193_4
        test al, 1
        jne .LBB193_8
        test rax, rax
        js .LBB193_14
        lea r8, [rax + 68]
        mov sil, 1
        jmp .LBB193_9
        .p2align        4
.LBB193_2:
        xor esi, esi
        mov r9, rcx
        xor r8d, r8d
        mov qword ptr [r9], r8
        cmp dword ptr [rsp + 8], 1
        je .LBB193_11
        jmp .LBB193_12
        .p2align        4
.LBB193_4:
        xor esi, esi
        mov r8, rax
        jmp .LBB193_9
.LBB193_8:
        mov r8, rax
        or r8, 4
        xor esi, esi
.LBB193_9:
        mov qword ptr [rsp + 8], 1
        mov r9, rdx
        mov qword ptr [r9], r8
        cmp dword ptr [rsp + 8], 1
        jne .LBB193_12
.LBB193_11:
        mov r8, qword ptr [rsp + 16]
        lock cmpxchg    qword ptr [rdi], r8
        jne .LBB193_1
.LBB193_12:
        test sil, sil
        je .LBB193_13
        mov rax, qword ptr [rdi + 16]
        add rsp, 24
        .cfi_def_cfa_offset 8
        jmp qword ptr [rax + 8]
.LBB193_13:
        .cfi_def_cfa_offset 32
        add rsp, 24
        .cfi_def_cfa_offset 8
        ret
.LBB193_14:
        .cfi_def_cfa_offset 32
        lea rdi, [rip + .Lanon.cd1a78ec98df3f76d56fd1466ac0099f.165]
        lea rdx, [rip + .Lanon.cd1a78ec98df3f76d56fd1466ac0099f.166]
        mov esi, 47
        call qword ptr [rip + core::panicking::panic@GOTPCREL]
poll
.section .text.tokio::runtime::task::raw::poll::hd96dfd798919c755,"ax",@progbits
	.p2align	4
.type	tokio::runtime::task::raw::poll::hd96dfd798919c755,@function
tokio::runtime::task::raw::poll::hd96dfd798919c755:
	.cfi_startproc
	.cfi_personality 155, DW.ref.rust_eh_personality
-	.cfi_lsda 27, .Lexception49
+	.cfi_lsda 27, .Lexception51
	push rbp
	.cfi_def_cfa_offset 16
	push r15
	.cfi_def_cfa_offset 24
	push r14
	.cfi_def_cfa_offset 32
	push r12
	.cfi_def_cfa_offset 40
	push rbx
	.cfi_def_cfa_offset 48
	sub rsp, 512
	.cfi_def_cfa_offset 560
	.cfi_offset rbx, -48
	.cfi_offset r12, -40
	.cfi_offset r14, -32
	.cfi_offset r15, -24
	.cfi_offset rbp, -16
	mov rbx, rdi
	call qword ptr [rip + tokio::runtime::task::state::State::transition_to_running::h51d78340084c6090@GOTPCREL]
	movzx eax, al
	lea rcx, [rip + .LJTI77_0]
	movsxd rax, dword ptr [rcx + 4*rax]
	add rax, rcx
	jmp rax
	mov rax, qword ptr [rip + tokio::runtime::task::waker::WAKER_VTABLE::h0490a160f2a7ec56@GOTPCREL]
	mov qword ptr [rsp], rax
	mov qword ptr [rsp + 8], rbx
	lea r14, [rbx + 32]
	mov rax, rsp
	mov qword ptr [rsp + 24], rax
	mov qword ptr [rsp + 32], 0
	mov qword ptr [rsp + 16], rax
	cmp dword ptr [rbx + 56], 0
	jne .LBB77_14
	mov rdi, qword ptr [rbx + 40]
	call qword ptr [rip + tokio::runtime::task::core::TaskIdGuard::enter::h31387a37d88abbe0@GOTPCREL]
-	lea rdi, [rbx + 64]
	mov qword ptr [rsp + 48], rax
+	mov r12, qword ptr [rbx + 40]
+	mov rax, qword ptr [rip + __usdt_sema_tokio_task__poll__start@GOTPCREL]
+	cmp word ptr [rax], 0
+	je .LBB77_5
+	mov rdi, r12
+	call qword ptr [rip + tokio::util::usdt::usdt_impl::task_poll_start::probe_inner::hda61800a23fcc40d@GOTPCREL]
+.LBB77_5:
+	lea rdi, [rbx + 64]
	lea rsi, [rsp + 16]
	call simple_echo_tcp::main::{{closure}}::h5671ec2a34b607f8
	mov ebp, eax
+	mov rax, qword ptr [rip + __usdt_sema_tokio_task__poll__end@GOTPCREL]
+	cmp word ptr [rax], 0
+	je .LBB77_8
+	mov rdi, r12
+	call qword ptr [rip + tokio::util::usdt::usdt_impl::task_poll_end::probe_inner::h476c2bd4bfb2eb31@GOTPCREL]
+.LBB77_8:
	lea rdi, [rsp + 48]
	call qword ptr [rip + <tokio::runtime::task::core::TaskIdGuard as core::ops::drop::Drop>::drop::hab02cca3e87fef69@GOTPCREL]
	test bpl, bpl
	je .LBB77_10
	mov rdi, rbx
	call qword ptr [rip + tokio::runtime::task::state::State::transition_to_idle::ha74e5e89ce88109b@GOTPCREL]
	movzx eax, al
	lea rcx, [rip + .LJTI77_1]
	movsxd rax, dword ptr [rcx + 4*rax]
	add rax, rcx
	jmp rax
	mov rdi, r14
	mov rsi, rbx
	call qword ptr [rip + tokio::runtime::scheduler::multi_thread::handle::<impl tokio::runtime::task::Schedule for alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::yield_now::hf966c4e666337dba@GOTPCREL]
	mov rdi, rbx
	call qword ptr [rip + tokio::runtime::task::state::State::ref_dec::h3033e08956b2d202@GOTPCREL]
	test al, al
	je .LBB77_55
	mov rdi, rbx
	call core::ptr::drop_in_place<tokio::runtime::task::core::Cell<simple_echo_tcp::main::{{closure}},alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::h8c2b259dda47c6ee
	jmp .LBB77_54
	mov rdi, rbx
	call core::ptr::drop_in_place<tokio::runtime::task::core::Cell<simple_echo_tcp::main::{{closure}},alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::h8c2b259dda47c6ee
.LBB77_54:
	mov esi, 384
	mov edx, 128
	mov rdi, rbx
	call qword ptr [rip + __rustc[de0091b922c53d7e]::__rust_dealloc@GOTPCREL]
	jmp .LBB77_55
	lea r14, [rbx + 32]
+	mov rax, qword ptr [rip + __usdt_sema_tokio_task__terminate@GOTPCREL]
+	cmp word ptr [rax], 0
+	je .LBB77_38
+	mov rdi, qword ptr [rbx + 40]
+	mov esi, 1
+	call qword ptr [rip + tokio::util::usdt::usdt_impl::task_terminate::probe_inner::h7db442807ccea7e2@GOTPCREL]
+.LBB77_38:
	mov dword ptr [rsp + 280], 2
	lea rsi, [rsp + 280]
	mov rdi, r14
	call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
.LBB77_39:
	xor eax, eax
.LBB77_41:
	mov rcx, qword ptr [rbx + 40]
	mov qword ptr [rsp + 56], rcx
	mov qword ptr [rsp + 64], rax
	mov qword ptr [rsp + 72], rdx
	mov dword ptr [rsp + 48], 1
	lea rsi, [rsp + 48]
	mov rdi, r14
	call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
	jmp .LBB77_42
.LBB77_10:
	mov rax, qword ptr [rip + __usdt_sema_tokio_task__terminate@GOTPCREL]
	cmp word ptr [rax], 0
	je .LBB77_12
	mov rdi, qword ptr [rbx + 40]
	xor esi, esi
	call qword ptr [rip + tokio::util::usdt::usdt_impl::task_terminate::probe_inner::h7db442807ccea7e2@GOTPCREL]
.LBB77_12:
	mov dword ptr [rsp + 48], 2
	lea rsi, [rsp + 48]
	mov rdi, r14
	call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
	xor ecx, ecx
.LBB77_25:
	mov qword ptr [rsp + 288], rcx
	mov qword ptr [rsp + 296], rax
	mov qword ptr [rsp + 304], rdx
	mov dword ptr [rsp + 280], 1
	lea rsi, [rsp + 280]
	mov rdi, r14
	call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
.LBB77_42:
	mov rdi, rbx
	call tokio::runtime::task::harness::Harness<T,S>::complete::hbafcc89c840d433c
.LBB77_55:
	add rsp, 512
	.cfi_def_cfa_offset 48
	pop rbx
	.cfi_def_cfa_offset 40
	pop r12
	.cfi_def_cfa_offset 32
	pop r14
	.cfi_def_cfa_offset 24
	pop r15
	.cfi_def_cfa_offset 16
	pop rbp
	.cfi_def_cfa_offset 8
	ret
	.cfi_def_cfa_offset 560
+	mov rax, qword ptr [rip + __usdt_sema_tokio_task__terminate@GOTPCREL]
+	cmp word ptr [rax], 0
+	je .LBB77_46
+	mov rdi, qword ptr [rbx + 40]
+	mov esi, 1
+	call qword ptr [rip + tokio::util::usdt::usdt_impl::task_terminate::probe_inner::h7db442807ccea7e2@GOTPCREL]
+.LBB77_46:
	mov dword ptr [rsp + 280], 2
	lea rsi, [rsp + 280]
	mov rdi, r14
	call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
	jmp .LBB77_39
.LBB77_14:
	lea rax, [rip + .Lanon.5efc90b22d074bac3f60c9ed09935ae4.75]
	mov qword ptr [rsp + 48], rax
	mov qword ptr [rsp + 56], 1
	mov qword ptr [rsp + 64], 8
	xorps xmm0, xmm0
	movups xmmword ptr [rsp + 72], xmm0
	lea rsi, [rip + .Lanon.5efc90b22d074bac3f60c9ed09935ae4.76]
	lea rdi, [rsp + 48]
	call qword ptr [rip + core::panicking::panic_fmt::h5138da2ef87be35b@GOTPCREL]
	ud2
	jmp .LBB77_51
	mov rdi, rax
	call qword ptr [rip + std::panicking::catch_unwind::cleanup::h80150e146981252a@GOTPCREL]
	jmp .LBB77_41
	call qword ptr [rip + core::panicking::panic_cannot_unwind::h2df093f6b1708ee2@GOTPCREL]
	mov rdi, rax
	call qword ptr [rip + std::panicking::catch_unwind::cleanup::h80150e146981252a@GOTPCREL]
	mov r15, rax
	test rax, rax
	je .LBB77_42
	mov r12, rdx
	mov rax, qword ptr [rdx]
	test rax, rax
	je .LBB77_30
	mov rdi, r15
	call rax
.LBB77_30:
	mov rsi, qword ptr [r12 + 8]
	test rsi, rsi
	je .LBB77_42
	mov rdx, qword ptr [r12 + 16]
	mov rdi, r15
	call qword ptr [rip + __rustc[de0091b922c53d7e]::__rust_dealloc@GOTPCREL]
	jmp .LBB77_42
	mov r14, rax
	mov rsi, qword ptr [r12 + 8]
	test rsi, rsi
	je .LBB77_35
	mov rdx, qword ptr [r12 + 16]
	mov rdi, r15
	jmp .LBB77_34
	call qword ptr [rip + core::panicking::panic_cannot_unwind::h2df093f6b1708ee2@GOTPCREL]
	mov r15, rax
-	lea rdi, [rsp + 48]
-	call qword ptr [rip + <tokio::runtime::task::core::TaskIdGuard as core::ops::drop::Drop>::drop::hab02cca3e87fef69@GOTPCREL]
+	mov rax, qword ptr [rip + __usdt_sema_tokio_task__poll__end@GOTPCREL]
+	cmp word ptr [rax], 0
+	je .LBB77_19
+	mov rdi, r12
+	call qword ptr [rip + tokio::util::usdt::usdt_impl::task_poll_end::probe_inner::h476c2bd4bfb2eb31@GOTPCREL]
	jmp .LBB77_19
-       call qword ptr [rip + core::panicking::panic_in_cleanup::h8f68387bb6cbbf54@GOTPCREL]
	mov rdi, rax
	call qword ptr [rip + std::panicking::catch_unwind::cleanup::h80150e146981252a@GOTPCREL]
	jmp .LBB77_41
	call qword ptr [rip + core::panicking::panic_cannot_unwind::h2df093f6b1708ee2@GOTPCREL]
.LBB77_51:
	mov r14, rax
	mov esi, 384
	mov edx, 128
	mov rdi, rbx
.LBB77_34:
	call qword ptr [rip + __rustc[de0091b922c53d7e]::__rust_dealloc@GOTPCREL]
.LBB77_35:
	mov rdi, r14
	call _Unwind_Resume@PLT
	mov r15, rax
.LBB77_19:
-	mov dword ptr [rsp + 280], 2
-	lea rsi, [rsp + 280]
+	lea rdi, [rsp + 48]
+	call qword ptr [rip + <tokio::runtime::task::core::TaskIdGuard as core::ops::drop::Drop>::drop::hab02cca3e87fef69@GOTPCREL]
+	jmp .LBB77_22
+	call qword ptr [rip + core::panicking::panic_in_cleanup::h8f68387bb6cbbf54@GOTPCREL]
+	mov r15, rax
+.LBB77_22:
	mov rdi, r14
-	call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
+	call core::ptr::drop_in_place<tokio::runtime::task::harness::poll_future::{{closure}}::Guard<simple_echo_tcp::main::{{closure}},alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle>>>::he33e022ff410fd70
	mov rdi, r15
	call qword ptr [rip + std::panicking::catch_unwind::cleanup::h80150e146981252a@GOTPCREL]
	mov rcx, qword ptr [rbx + 40]
	jmp .LBB77_25
	call qword ptr [rip + core::panicking::panic_cannot_unwind::h2df093f6b1708ee2@GOTPCREL]
	call qword ptr [rip + core::panicking::panic_in_cleanup::h8f68387bb6cbbf54@GOTPCREL]

@Darksonn
Copy link
Contributor

Darksonn commented Nov 7, 2025

The docs found here say that:

SDT probes are designed to have a tiny runtime code and data footprint and no dynamic relocations.

But the assembly you shared contains relocations such as __usdt_sema_tokio_task__poll__end@GOTPCREL. Is the assembly correct?

@conradludgate
Copy link
Contributor Author

The docs found here say that:

SDT probes are designed to have a tiny runtime code and data footprint and no dynamic relocations.

But the assembly you shared contains relocations such as __usdt_sema_tokio_task__poll__end@GOTPCREL. Is the assembly correct?

I think this is a necessary part of how I'm currently avoiding over-monomorphisation. Each probe can only have one semaphore, so each probe callsite that wants to check the semaphore must necessarily need to relocate for the global.


I'd like to figure out why the monomorphisation is causing issues with the linker because then we could eliminate most of the semaphores entirely.

I did also play around with trying to move the poll probes higher up the stack where the runtime is still polymorphic, but I wasn't too happy with it. Maybe I can rework/reconsider it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants